Compiler Optimizations for Cache Locality and Coherence
نویسنده
چکیده
Almost every modern processor is designed with a memory hierarchy organized into several levels, each of which is smaller, faster, and more expensive than the level below. High performance requires the eeective use of the cached data, i.e. cache locality. Smart compiler transformations can relieve the programmer from hand-optimizing for the speciic machine architectures. In a multiprocessor system, data inconsistency may occur between memory and caches. For example, the memory and multiple caches may have inconsistent copies of the same cache block. This introduces the problem of cache coherence. Several cache coherence protocols have been developed to maintain data coherence for multiple processors. Since multiple variables are located in the same block, it may cause the problem of false sharing, which has been identiied by many researchers as a major obstacle to high performance. Therefore, in a multiprocessor system, we need to avoid false sharing as well as exploit cache locality. In this paper, we rst develop a new data reuse model and an algorithm called height reduction to improve cache locality. The advantage of this algorithm is that it can improve band matrix programs as well as dense matrix programs. It is more accurate and general than the existing techniques on improving cache locality, which were developed to optimize dense matrix programs. Then with the height reduction algorithm, we extend loop tiling to exploit not only intra-tile data locality but also inter-tile data locality. We call the new tiling aanity tiling. Our experiments show that aanity tiling is less sensitive to the choice of the tile size. Finally, we show that the algorithm also helps to eliminate or reduce false sharing in multiprocessor systems. With the height reduction algorithm and aanity tiling, signiicant performance improvement (speedups from 2.5 to 10) has been observed on HP workstations and KSR1 multiprocessors.
منابع مشابه
Exploiting Spatial Store Locality Through Permission Caching in Software DSMs
Fine-grained software-based distributed shared memory (SWDSM) systems typically maintain coherence with in-line checking code at load and store operations to shared memory. The instrumentation overhead of this added checking code can be severe. This paper (1) shows that most of the instrumentation overhead in the fine-grained SW-DSM system DSZOOM is store-related, (2) introduces a new write per...
متن کاملOptimizing Instruction Cache Performance for Operating System Intensive Workloads 1
High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to use an optimizing compiler to minimize cache interference via an improved layout of the code. This technique, however , has been applied to application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patte...
متن کاملCacheminer: A Runtime Approach to Exploit Cache Locality on SMP
ÐExploiting cache locality of parallel programs at runtime is a complementary approach to a compiler optimization. This is particularly important for those applications with dynamic memory access patterns. We propose a memory-layout oriented technique to exploit cache locality of parallel loops at runtime on Symmetric Multiprocessor (SMP) systems. Guided by application-dependent and targeted ar...
متن کاملStructure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements
A common performance problem faced by today's application programs is poor data locality. Real-world applications are often written to traverse data structures in a manner that results in data cache miss overhead. These data structures are often declared as structs in C and classes in C++. Compiler optimizations try to modify the layout of such data structures so that they are accessed in a mor...
متن کاملEvaluation, Implementation and Performance of Write Permission Caching in the DSZOOM System
Fine-grained software-based distributed shared memory (SWDSM) systems typically maintain coherence with in-line checking code at load and store operations to shared memory. The instrumentation overhead of this added checking code can be severe. This paper (1) shows that most of the instrumentation overhead in the fine-grained DSZOOM SW-DSM system is store related, (2) introduces a new write per...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1994